Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce APIError for Sync Calls from CLI to Orchestrator Node #4366

Merged
merged 18 commits into from
Sep 11, 2024

Conversation

udsamani
Copy link
Contributor

@udsamani udsamani commented Sep 8, 2024

This PR aims at introducing APIError, which is responsible for encapsulating all Synchronous error for requestes from CLI to Orchestrator Node.

We introduce a CustomHttpErrorHandler, which is responsible for translating errors to correct response status code. For example if a job does not exist, we should return 404 and not 500.

in addition, to this we also make sure that CLI outputs are clean outputs and not involving the HTTP Response codes. There is not value in exposing HTTP Status Codes on CLI as it adds no values.

CLI Outputs Before this PR Change

Screenshot 2024-09-09 at 00 02 29 Screenshot 2024-09-09 at 00 02 18 Screenshot 2024-09-09 at 00 01 43

CLI Outputs After this PR Change

Screenshot 2024-09-09 at 00 10 03 Screenshot 2024-09-09 at 00 10 16 Screenshot 2024-09-09 at 00 10 30

closes #4234

@udsamani udsamani requested review from wdbaruni and frrist September 8, 2024 23:11
cmd/cli/job/describe.go Outdated Show resolved Hide resolved
@@ -176,7 +177,7 @@ func (b *BoltJobStore) getJob(tx *bolt.Tx, jobID string) (models.Job, error) {

data := GetBucketData(tx, NewBucketPath(BucketJobs, jobID), SpecKey)
if data == nil {
return job, bacerrors.NewJobNotFound(jobID)
return job, apimodels.NewJobNotFound(jobID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't sound right that internal and backend components depend on apimodels which are meant for the http API between frontend (api server) and clients.

Some suggestions:

  1. We should be able to define internal errors separate from the APIs, and then implement interceptors that can translate internal errors to http errors
  2. At the same time we should avoid having to translate each individual error to http format. So maybe having an interface that each of these errors implement, or a common underlying error type can simplify the translation
  3. We should avoid having a central pkg that defines all internal errors in the system like what we have today with bacerror

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, please let me know your thoughts about it.

Comment on lines 31 to 38
// httpstatuscode is the http status code associated with this error.
// it should correspond to standard http status codes (e.g., 400, 404, 500).
HTTPStatusCode int `json:"code"`

// message is a short, human-readable description of the error.
// it should be concise and provide a clear indication of what went wrong.
Message string `json:"message"`
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also return:

  • RequestID to help debug and grep server logs for the failed request
  • ErrorCode that uniquely identify the error error without having to parse the message, similar to AWS

Keep in mind that some of this information are already in the http headers, such as status code and request id. It might be redundant to serialize them in the APIError. Maybe we should omit serializing them, and just have the client set those values from the headers.

Copy link
Contributor Author

@udsamani udsamani Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added the RequestID. However, adding a bacalhau specific error code is tricky at this point. Mainly because we are need to make sure that we properly propagate error codes correctly every where. In addition, we need to think about how to formulate error codes for errors that occur before even we reach to bacalhau specific components, for example echo server middlewares.

I think we should pick that up in follow ups. Let me know what you think

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah please feel free to do this in a follow up PR. What I had in mind is something similar to the code below. The idea is to have standard error codes that can be used by our core components and even the modules if the errors apply, while still enabling the modules to define their own specific error code, such as BucketNotFound and ContainerStartFailed

The HTTPCode can be explicitly provided, inferred from the error code if it is a standard one, or fallback to 500.

What I really liked in your PR is the introduction of Component field which helps identify the source of the error and also removes the need for these error codes to be globally unique. I would just make the components more readable and descriptive, such as S3Publisher, Docker instead of abbreviations. To simplify things, I am proposing the default component to be Bacalhau, which doesn't mean much, but will allow us to incrementally improve the coverage of better errors.

Also as we progress, we should allow components to override the Component piece of the errors. Meaning a job not found error with JobStore component doesn't help much compared to knowing the component that actually called the job store. We don't need to do these changes today, but just thinking ahead.

What do you think?

const (
    BadRequest ErrorCode = "BadRequest"
    Unauthorized ErrorCode = "Unauthorized"
    ResourceNotFound ErrorCode = "ResourceNotFound"
    ResourceInUse ErrorCode = "ResourceInUse"
    ValidationFailed ErrorCode = "ValidationFailed"
    VersionMismatch ErrorCode = "VersionMismatch"
    TooManyRequests ErrorCode = "TooManyRequests"
    
    InternalError ErrorCode = "InternalError"
    NotImplemented ErrorCode = "NotImplemented"
    ServiceUnavailable ErrorCode = "ServiceUnavailable"
    DatastoreFailure ErrorCode = "DatastoreFailure"
    NetworkFailure ErrorCode = "NetworkFailure"
    ConfigurationError ErrorCode = "ConfigurationError"
    ResourceExhausted ErrorCode = "ResourceExhausted"
)

// Note: Component-specific error codes should be defined in their respective packages.
// Examples might include:
// - BucketNotFound
// - ObjectNotFound
// - InvalidCredentials
// - ImageNotFound
// - ContainerStartFailed
// - PeerConnectionFailed

type BaseError struct {
    Code           ErrorCode         `json:"code"`
    Message        string            `json:"message"`
    Component      string            `json:"component"`
    HTTPCode       int               `json:"http_code"`
    Hint           string            `json:"hint,omitempty"`
    Retryable      bool              `json:"retryable"`
    FailsExecution bool              `json:"fails_execution"`
    Details        map[string]string `json:"details,omitempty"`
}

func NewBaseError(format string, a ...any) *BaseError {
    return &BaseError{
        Code:      InternalError,
        Message:   fmt.Sprintf(format, a...),
        Component: "Bacalhau",
        HTTPCode:  0, // Will be inferred later if not set explicitly
    }
}

func (e *BaseError) Error() string {
    return fmt.Sprintf("[%s] %s: %s", e.Code, e.Component, e.Message)
}

func (e *BaseError) WithErrorCode(code ErrorCode) *BaseError {
    e.Code = code
    return e
}

func (e *BaseError) WithHTTPCode(httpCode int) *BaseError {
    e.HTTPCode = httpCode
    return e
}

func (e *BaseError) HTTPStatusCode() int {
    if e.HTTPCode != 0 {
        return e.HTTPCode
    }
    return inferHTTPStatusCode(e.Code)
}

func inferHTTPStatusCode(code ErrorCode) int {
    switch code {
    case BadRequest, ValidationFailed:
        return http.StatusBadRequest
    case Unauthorized:
        return http.StatusUnauthorized
    case ResourceNotFound:
        return http.StatusNotFound
    case ResourceInUse, VersionMismatch:
        return http.StatusConflict
    case TooManyRequests:
        return http.StatusTooManyRequests
    case NotImplemented:
        return http.StatusNotImplemented
    case ServiceUnavailable:
        return http.StatusServiceUnavailable
    default:
        return http.StatusInternalServerError
    }
}

// rest of the methods

pkg/publicapi/endpoint/orchestrator/job.go Outdated Show resolved Hide resolved
pkg/publicapi/endpoint/orchestrator/job.go Outdated Show resolved Hide resolved
pkg/publicapi/middleware/error_handler.go Outdated Show resolved Hide resolved
@udsamani udsamani requested a review from wdbaruni September 11, 2024 01:56
@udsamani udsamani self-assigned this Sep 11, 2024
Comment on lines 74 to 75
func NewBoltDbError(message string) *models.BaseError {
return models.NewBaseError(message).WithCode(models.NewErrorCode(BOLTDB_COMPONENT, 500))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this should fit in boltdb pkg as it is related to the job store implementation. I understand I asked to define and use errors mainly in the jobstore pkg, but that should be for common errors that apply to different implementations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 38 to 40

// RequestID is the request ID of the request that caused the error.
RequestID string `json:"request_id"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you might want to define RequestID and HTTPStatusCode as transient fields so you can avoid serializing them and then populate them in the client from the headers. This is doable by using json:- tag

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 66 to 68
func isDebugMode() bool {
return os.Getenv("BACALHAU_DEBUG") == "true"
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have some logic in enabling debug mode in our http server, but they are not consistent yet, and maybe we should unify them and avoid adding a third mode

// enable debug mode to get clearer error messages
// TODO: disable debug mode after we implement our own error handler
server.Router.Debug = true
// set middleware
logLevel, err := zerolog.ParseLevel(params.Config.LogLevel)
if err != nil {
return nil, err
}

I would say we can just rely on the LogLevel configuration, and if it is debug or trace, then we set Router.Debug = true

In your error handler you can then do if c.Echo.Debug == true to check if debug mode is enabled in the router

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -190,7 +191,7 @@ func (b *BoltJobStore) reifyJobID(tx *bolt.Tx, jobID string) (string, error) {
if idgen.ShortUUID(jobID) == jobID {
bktJobs, err := NewBucketPath(BucketJobs).Get(tx, false)
if err != nil {
return "", err
return "", jobstore.NewBoltDbError(err.Error())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would suggest to pass the actual error instead of just the string. This will allow NewBoltDbError to intercept the error and return different types or codes based on the actual error.

My not be needed here or now, but I am just thinking ahead when we capture and throw errors for docker executor and s3 publisher for example. You might find this pattern useful to centralize intercepting errors per component

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@udsamani udsamani requested a review from wdbaruni September 11, 2024 12:35
@udsamani udsamani merged commit 681eae8 into main Sep 11, 2024
4 checks passed
@udsamani udsamani deleted the job-not-found branch September 11, 2024 14:32
frrist pushed a commit that referenced this pull request Sep 17, 2024
wdbaruni added a commit that referenced this pull request Sep 19, 2024
- introduced by #4366, we need to close the body only if there isn't an
error, else the body is nil and panics. Mentioned here
#4401 (comment)

Co-authored-by: frrist <[email protected]>
Co-authored-by: Walid Baruni <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

When describing a job that doesn't exist bacalhau returns a 500 error instead of expected 404
2 participants